CPSC 545/445 (Autumn 2003) - Class 12: Gene Finding (2) Module 4, Part 2 --- recap: - probabilistic models for gene finding, WMMs and WAMs --- 4.4 Hidden Markov Models (HMMs) generalisation of WMM, PAM, and various other probabilistic models. HMMs are graphical probabilistic models used in - modelling time series - speech recognition - optical character recognition - ion channel recordings in biology (bioinformatics), HMMs can be used for modeling: - coding/non-coding regions in DNA - protein binding sites in DNA - protein superfamilies -- since the mid-1990s, HMMs + machine learning techniques are used for: - modeling - aligning - analysing DNA regions and protein families. -- HMMs are closely related to other formal models: - neural networks - stochastic grammars - Baysian networks [example: fig 7.1] -- An HMM is given by: - S: finite set of states (e.g., intron, exon) convention: use begin + end states S,E - A: discrete alphabet of symbols (A,T,C,G) - T=(t_ij): probability transition matrix -> transition probabilities t_ij between states i,j - E=(e_ix) probability emission matrix -> emission probabilities e_ix for symbols x in state i [slide: Figure 7.1] -- How does it work? - system randomy evolves from state to state while emitting symbols from the alphabet - emission and transition depend on current state only (Markov assumption) - only emitted symbols are observable, states are hidden [example 1] -- problems related to HMMs: - how likely is a given sequence given an HMM? -> given model, obs sequence O, prob of O? (likelihood) - what is the most probable sequence of transitions and emissions given a sequence? -> given model, obs seq O, most prob sequence of states & emissions? (decoding) - if the transition and emission probabilities not known, how can they be learned? -> given uncertain model m, obs seq O, how should m be revised based on O? (learning) -- computation of likelihood P(O | model) naive method: can easily determine P(O,path | model) for given O, path, model o = o_1 ... o_L p = p_0 p_1 p_2 ... p_L p_L+1 p_0 = B, p_L+1 = E P(O,p | model) = t_{0,p_1} \product_{i=1..L} e_{p_i,o_i} t_{p_i,p_i+1} based on this: P(O | model) = sum{all paths p} P(O,p | model) -> summing over all paths leads to exponential worst case complexity (why? since there may be exponentially many paths of length L) forward algorithm (for determining likelihood) - key insight: 'reconstruct' observation sequence, need only to compute/store prob of being in state i after emitting sequence o_1 ... o_i - recursively compute f_k(i) = P(o_1...o_i, p_i=k | model) using f_l(i+1) = e_{l,o_{i+1}} \sum_{all states k} f_k(i) t_k,l - compute this in one forward pass through O - note: f_E(L+1) = P(o_1...o_L, p_{L+1}=E | model) - complexity: N^2 (length of X * num states) [example 2] -- most likely state for any given time P(p_i=k | O, model) note: P(p_i=k | O, model) = P(p_i=k, O | model) / P(O | model) [according to bayes rule] obtain P(O | model) from fwd alg P(p_i=k, O | model) = P(o_1...o_i, p_i=k | model) * P(o_i+1...o_L | p_i=k, model) note: P(o_1...o_i, p_i=k | model) = f_k(i) from fwd algorithm above can compute b_k(i) := P(o_i+1...o_L | p_i=k, model) from backward sweep through O, very similarly to f_k(i) -> reverse forward algorithm (backward algorithm) [see Durbin et al., p.59] -- decoding (given HMM and observation sequence, find most probable state path): -> viterbi algorithm (dynamic programming algorithm, conceptually similar to sequence alignment algorithms from Module 2 and very similar to fwd algorithm above, see also Durbin et al., p.55ff) -- learning: - expectation maximisation (baum-welch) initialise with arbitrary model parameters uses fwd and backwd alg to iteratively modify model (see Durbin et al., Ch.3) -> maximise likelihood of model given data - gradient descent - viterbi learning similar to baum-welch, but re-estimate parameeters based on most probable paths (as determined by viterbi alg); typically faster, but giving slightly worse results than em/baum-welch -- The standard HMM architecture [time permitting] - used for modelling related families, e.g., families of proteins or RNAs, but also introns/exons, ... - use one state for each sequence position - model insertions and deletions by separate states - deletion ('gap') states are silent (no emissions) - self loops in insertion states can model variable length insertions - direct transitions between deletion staates model variable length deletions -> large number of parameters, training can be difficult, if relatively few sequences are given [slide: Figure 7.2] --- 4.5 Using HMMs for Gene Finding individual signal detectors probabilistic integration of various signals: - promotor regions - translation start & stop context sequences - reading frame periodicity - polyadenylation signals - intron splicing signals - compositional contrast between introns / exons - differences in nucleosome positioning signals - sequence determinants of topological domains (scaffold attachment regions, SARs) --- Genscan (Burge & Karlin, 1997) - based on probabilistic model of gene structure in human genomic sequence - emphasis on features recognised by general transcriptional, splicing, and translational machinery, e.g., TATA box, cap site in eukaryotic promotors (rather than signals specific to particular genes) - does not use similarity search - overall model similar to generalised HMM (explicit state duration HMM) - uses explicitly double stranded genomic sequence model -> potential genes on both strands are analysed simultaneously - covers cases where input sequence contains no gene, partial gene, complete gene, multiple genes - uses WMM and maximal dependency decomposition (MDD) to model functional signals - cannot handle overlapping transcription units - does not address alternative splicing [Figure GS] E_k, I_k = phase k internal exons, introns -- signal models used by Genscan: - WMM for transcriptional and translational signals (translation initiation, polyadenylation signals, TATA box etc.) probabilities estimated from GenBank annotated data - maximal dependency decomposition for splice signals (WMM and WAM inadequate) probabilistic composition of conditional WMMs -- exon models and non-coding state models used by Genscan: - probabilistic models based on conditional hexamer frequency - consistent reading frame is maintained throughout a gene --- HMMgene (Krogh, 1997) - different approach from Genscan: rather than model individual functional elements and combining them in to big model, combined model is estimated directly from labeled sequence data - based on class HMMs (CHMMs - HMMs where states are labeled and emit symbol + label) - uses clever machine learning algorithm for estimating CHMM from sequence data such that probability of correct labeling is maximised important features: - emission probabilities of states can depend on n previous states - allows states to share emission/transition probabilities (tying) --- Resources: Durbin et al., Ch.3 [introduction to HMMs] Baldi & Brunak, Ch. 7 [introduction to HMMs] A. Krogh: Two methods for improving performance of an HMM and their application for gene finding. Proc 5th Int. Conf. Intelligent Systems for Mol. Biol., 179-186, 1997. D. Haussler. Computational Genefinding. [online, from his webpage: www.cse.ucsc.edu/~haussler] J.W. Fickett. The gene identification problem: an overview for developers. Computers Chem. 20(1): 103-118, 1996. R. Guigo. Computational gene identification: an open problem. Computers Chem. 21(4): 215-222, 1997. ---